xxxxxxxxxx# The goal of the project### Based on the data about user and expert reviews, genres, platforms, and historical data on game sales, identify patterns that determine whether the game succesful or not to plan the advertising campaings for the online store Ice. xxxxxxxxxx### Table of Contents* [Step 1. Open the data file and study the general information](#1)* [Step 2. Prepare the data](#2) * [2.1. Renayming columns, dropping missing values and converting the data to the required types](#2_1) * [2.2. Add column 'total sales'](#2_2) * [2.3. Fill in missing values](#2_3)* [Step 3. Analyze the data](#3) * [3.1. Total sales by years](#3_1) * [3.2. Sales by platforms](#3_2) * [3.3. Research of selected platforms](#3_3) * [3.4. The global sales of all games, broken down by platform](#3_4) * [3.5. User and professional reviews affect sales for one popular platform X360](#3_5) * [3.6. Distribution of games by genre](#3_6)* [Step 4. Create a user profile for each region](#4) * [4.1. User profile for North American region](#4_1) * [4.2. User profile for Europe region](#4_2) * [4.3. User profile for Japan region](#4_3)* [Step 5. Test the hypotheses](#5) * [5.1. Average user ratings of the Xbox One and PC platforms are the same](#5_1) * [5.2. Average user ratings for the Action and Sports genres are different](#5_2)* [General Conclusion](#6)xxxxxxxxxx## Step 1. Open the data file and study the general information <a class="anchor" id="1"></a>xxxxxxxxxximport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport numpy as npfrom functools import reduceimport mathimport datetimefrom scipy import stats as stimport plotly.express as pximport plotly.graph_objects as goimport copyxxxxxxxxxxdata_ice = pd.read_csv('/datasets/games.csv')xxxxxxxxxxdata_ice.info()xxxxxxxxxxdata_ice.head()xxxxxxxxxxdisplay(data_ice.describe(include=['object']))display(data_ice.describe())xxxxxxxxxxAt this stage, the data file on game sales for 2016 was downloaded and stored into the DataFrame. From the received general information about the data. The DataFrame consists of 11 columns (of which 5 object type and 6 - float type) and 16715 rows. I can conclude that the tables look good, the missing values are in 6 columns, which we will deal with in the next step. The table contains data on games of 12 different genres, the number of unique values in the platform column - 31. The name column contains 11559 unique names, I find it necessary to check it for duplicates.At this stage, the data file on game sales for 2016 was downloaded and stored into the DataFrame. From the received general information about the data. The DataFrame consists of 11 columns (of which 5 object type and 6 - float type) and 16715 rows. I can conclude that the tables look good, the missing values are in 6 columns, which we will deal with in the next step. The table contains data on games of 12 different genres, the number of unique values in the platform column - 31. The name column contains 11559 unique names, I find it necessary to check it for duplicates.
xxxxxxxxxx### 2.1. Renayming columns, dropping missing values and converting the data to the required types<a class="anchor" id="2_1"></a>xxxxxxxxxxdata_ice.columns = data_ice.columns.str.lower()display(data_ice.isnull().sum())xxxxxxxxxxdata_ice.dropna(subset=['year_of_release'], inplace=True)data_ice['year_of_release'] = data_ice['year_of_release'].astype('int')xxxxxxxxxxdata = data_ice[data_ice['name'].isnull()]data_ice.dropna(subset=['name'], inplace=True)data_ice.shape[0]xxxxxxxxxx### 2.2. Add column 'total sales'<a class="anchor" id="2_2"></a>xxxxxxxxxxdata_ice['total_sales'] = data_ice['na_sales'] + data_ice['eu_sales'] + data_ice['jp_sales'] + data_ice['other_sales']data_ice['total_sales'].describe()xxxxxxxxxx### 2.3. Fill in missing values<a class="anchor" id="2_3"></a>xxxxxxxxxxdata = data_ice[data_ice['rating'].isnull()]display(data.head())data_ice = data_ice.fillna(0)data_ice.loc[data_ice['user_score'] == 'tbd', ['user_score']] = 0data_ice['user_score'] = data_ice['user_score'].astype(float)data_ice.describe()xxxxxxxxxxWhile processing the data, I have defined the relationship between the columns rating, critic, and user score. According to ESRB rating, the publisher of the game must submit an application to the company for its evaluation. I am assuming games that have missing values in rating columns means that their publishers just did not apply for rating. I decided to fill in the missing values in the columns user score, critic score, and rating with 0 so that I can continue to work with these columns. Also, I filled in the values TBD (to be determine) in the column user score to 0.I changed data types in the column year of release to the type int from float (since the year value can be only an integer value) and in the column user score since we will use the values of this column in our research and I need numerical values. Also, I decided to drop missing values in the column year and name, since I see no way to replace missing values with suitable ones, and it's only 1.5% from our dataset.I added the column with total sales which contain the sum of the sales from each region.While processing the data, I have defined the relationship between the columns rating, critic, and user score. According to ESRB rating, the publisher of the game must submit an application to the company for its evaluation. I am assuming games that have missing values in rating columns means that their publishers just did not apply for rating. I decided to fill in the missing values in the columns user score, critic score, and rating with 0 so that I can continue to work with these columns. Also, I filled in the values TBD (to be determine) in the column user score to 0. I changed data types in the column year of release to the type int from float (since the year value can be only an integer value) and in the column user score since we will use the values of this column in our research and I need numerical values. Also, I decided to drop missing values in the column year and name, since I see no way to replace missing values with suitable ones, and it's only 1.5% from our dataset. I added the column with total sales which contain the sum of the sales from each region.
data_pivot_by_years = data_ice.pivot_table(index='year_of_release', values='total_sales', aggfunc='sum')data_pivot_by_years.columns=['total_sales']barplot= sns.barplot(x=data_pivot_by_years.index, y=data_pivot_by_years['total_sales'])barplot.set_xticklabels(barplot.get_xticklabels(), rotation=90)plt.xlabel('Year of release')plt.ylabel('Total sales')plt.title('Total sales of video games by years, USD million')plt.show()xxxxxxxxxxThe graph above shows the total sales of video games by release year. Since 2001, video game sales have grown rapidly and picked at 2008. After 2008 the popularity of video games began to decline, and in 2012-2015 years total sales are approximately $250-350 million per year.The graph above shows the total sales of video games by release year. Since 2001, video game sales have grown rapidly and picked at 2008. After 2008 the popularity of video games began to decline, and in 2012-2015 years total sales are approximately $250-350 million per year.
xxxxxxxxxxdata_pivot_by_platforms = data_ice.pivot_table(index='platform', values='total_sales', aggfunc='sum')data_pivot_by_platforms.columns=['total_sales']data_pivot_by_platforms = data_pivot_by_platforms.sort_values('total_sales', ascending=False)barplot= sns.barplot(x=data_pivot_by_platforms.index, y=data_pivot_by_platforms['total_sales'])barplot.set_xticklabels(barplot.get_xticklabels(), rotation=90)plt.xlabel('Platform')plt.ylabel('Total sales')plt.title('Total sales of video games by platforms, USD million')xxxxxxxxxxThis figure shows the total sales of games on each platform. The most popular platform for the whole history of video games is the platform PS2. The highest revenue of video games on this platform reaches more than 1200 million USD. Next, the most popular platforms are X360, PS3, Wii. For my further research, I will take the two most popular platforms: PS2 and X360.This figure shows the total sales of games on each platform. The most popular platform for the whole history of video games is the platform PS2. The highest revenue of video games on this platform reaches more than 1200 million USD. Next, the most popular platforms are X360, PS3, Wii. For my further research, I will take the two most popular platforms: PS2 and X360.
xxxxxxxxxx### 3.3. Research of selected platforms<a class="anchor" id="3_3"></a>xxxxxxxxxxplatform_ps2 = ( data_ice .query('platform == "PS2"') .pivot_table(index='year_of_release', values='total_sales', aggfunc=['median', 'sum']))platform_ps2.columns=['median_sales', 'total_sales']barplot = sns.barplot(x=platform_ps2.index, y=platform_ps2['total_sales'])barplot.set_xticklabels(barplot.get_xticklabels(), rotation=90)plt.xlabel('Year of release')plt.ylabel('Total sales')plt.title('Total sales of video games by platform PS2, USD million')plt.show()xxxxxxxxxxThe histogram shows the total sales of video games per year of the most popular platform - PlayStation2. Video games on the PS2 platform were released in 2000 and from 2001 to 2005 their sales were very high. After 2005 sales began to decline, and in 2011 they stopped altogether. So, games on the PS2 platform have been on the market for 11 years.The histogram shows the total sales of video games per year of the most popular platform - PlayStation2. Video games on the PS2 platform were released in 2000 and from 2001 to 2005 their sales were very high. After 2005 sales began to decline, and in 2011 they stopped altogether. So, games on the PS2 platform have been on the market for 11 years.
xxxxxxxxxxplatform_x360 = ( data_ice .query('platform == "X360"') .pivot_table(index='year_of_release', values='total_sales', aggfunc=['sum']))platform_x360.columns=['total_sales']barplot = sns.barplot(x=platform_x360.index, y=platform_x360['total_sales'])barplot.set_xticklabels(barplot.get_xticklabels(), rotation=90)plt.xlabel('Year of release')plt.ylabel('Total sales')plt.title('Total sales of video games by platform X360, USD million')plt.show()xxxxxxxxxxThe histogram shows the total sales of video games per year of the second popular platform - X360. Video games on the X360 platform were released in 2005 and the majority of the sample is concentrated between 2008 and 2011 years, peaking in 2010 and declining after until 2016 (about 12 years on the market).The histogram shows the total sales of video games per year of the second popular platform - X360. Video games on the X360 platform were released in 2005 and the majority of the sample is concentrated between 2008 and 2011 years, peaking in 2010 and declining after until 2016 (about 12 years on the market).
xxxxxxxxxxplatform_2015 = ( data_ice .query('year_of_release == 2015') .pivot_table(index='platform', values='total_sales', aggfunc='sum'))platform_2015.columns=['total_sales']platform_2015 = platform_2015.sort_values('total_sales', ascending=False)barplot = sns.barplot(x=platform_2015.index, y=platform_2015['total_sales'])barplot.set_xticklabels(barplot.get_xticklabels(), rotation=90)plt.xlabel('Platforms')plt.ylabel('Total sales')plt.title('Total sales of video games in 2015, USD million')plt.show()xxxxxxxxxxI decided to display the most popular platforms of 2015 and compare the total sales on these platforms with total sales of the most profitable platforms to get a clear idea of how the sales of video games have changed (the chart below), and each of the platforms are gaining popularity or on the peak of popularity to decide which games on which platforms will be the best-selling in 2017.I decided to display the most popular platforms of 2015 and compare the total sales on these platforms with total sales of the most profitable platforms to get a clear idea of how the sales of video games have changed (the chart below), and each of the platforms are gaining popularity or on the peak of popularity to decide which games on which platforms will be the best-selling in 2017.
xxxxxxxxxxplatform_ps4 = ( data_ice .query('platform == "PS4"') .pivot_table(index='year_of_release', values='total_sales', aggfunc=['sum']))platform_ps4.columns=['total_sales']platform_xone = ( data_ice .query('platform == "XOne"') .pivot_table(index='year_of_release', values='total_sales', aggfunc=['sum']))platform_xone.columns=['total_sales']platform_3ds = ( data_ice .query('platform == "3DS"') .pivot_table(index='year_of_release', values='total_sales', aggfunc=['sum']))platform_3ds.columns=['total_sales']xxxxxxxxxxType Markdown and LaTeX:
xxxxxxxxxxbar_plots = [ go.Bar(x=platform_ps2.index, y=platform_ps2['total_sales'], name='PS2'), go.Bar(x=platform_x360.index, y=platform_x360['total_sales'], name='X360'), go.Bar(x=platform_ps4.index, y=platform_ps4['total_sales'], name='PS4'), go.Bar(x=platform_xone.index, y=platform_xone['total_sales'], name='XOne'), go.Bar(x=platform_3ds.index, y=platform_3ds['total_sales'], name='3DS')]layout = go.Layout( title = go.layout.Title(text='Total sales of selected platforms', x=0.5), yaxis_title = 'Total sales', xaxis_tickmode = 'array', xaxis_ticktext = tuple(data_ice['year_of_release'].values))fig = go.Figure(data=bar_plots, layout=layout)fig.show()xxxxxxxxxxThis graph shows the total sales of the two most popular platforms in the history of video games (PS2 and X360) and of the most popular platforms in 2015. Total sales of the PS2 are growing rapidly, I believe it will also remain the most popular in 2017.Due to the fact that the lifecycle of the platform is about 10-12 years and this number is decreasing, and that the data for 2016 may be incomplete, I propose to consider data from 2011 inclusive. So I identified two platforms with the greatest potential in 2017: PS4 and XOne.This graph shows the total sales of the two most popular platforms in the history of video games (PS2 and X360) and of the most popular platforms in 2015. Total sales of the PS2 are growing rapidly, I believe it will also remain the most popular in 2017. Due to the fact that the lifecycle of the platform is about 10-12 years and this number is decreasing, and that the data for 2016 may be incomplete, I propose to consider data from 2011 inclusive. So I identified two platforms with the greatest potential in 2017: PS4 and XOne.
xxxxxxxxxxdata_ice_from_2011 = data_ice.loc[data_ice['year_of_release'] >= 2011]xxxxxxxxxx### 3.4. The global sales of all games, broken down by platform<a class="anchor" id="3_4"></a>xxxxxxxxxxdata_group_by_platforms = data_ice_from_2011.sort_values(by = 'platform')plt.figure(figsize = (10,8))sns.boxplot(x='platform', y='total_sales', data=data_group_by_platforms)plt.xlabel("Platform")plt.ylabel("Total sales, USD million")plt.title('The influence of the platform of video games on total sales')data_group_by_platforms.shape[0]xxxxxxxxxxdef remove_outlier(df, col): q1 = df['total_sales'].quantile(0.25) q3 = df['total_sales'].quantile(0.75) iqr = q3 - q1 lower_limit = q1 - (1.5*iqr) upper_limit = q3 + (1.5*iqr) out_df = df.loc[(df['total_sales'] > lower_limit) & (df['total_sales'] < upper_limit)] return out_dfxxxxxxxxxxdata_group_by_platforms_true = remove_outlier(data_group_by_platforms, 'total_sales')plt.figure(figsize = (10,8))sns.boxplot(x='platform', y='total_sales', data=data_group_by_platforms_true)plt.xlabel("Platform")plt.ylabel("Total sales, USD million")plt.title('The influence of the platform of video games on total sales')data_group_by_platforms_true.shape[0]xxxxxxxxxxIn this step, I'll determine the upper limits of outliers, removed the outliers, and stored them in a separate DataFrame. After deleting the outliers I built the chart which shows us the influence of the platform on total sales. The platforms with the biggest value are X360, PS3, Wii, WiiU, and XOne.In this step, I'll determine the upper limits of outliers, removed the outliers, and stored them in a separate DataFrame. After deleting the outliers I built the chart which shows us the influence of the platform on total sales. The platforms with the biggest value are X360, PS3, Wii, WiiU, and XOne.
xxxxxxxxxx### 3.5. User and professional reviews affect sales for one popular platform X360<a class="anchor" id="3_5"></a>xxxxxxxxxximport sysimport warningsif not sys.warnoptions: warnings.simplefilter("ignore")xxxxxxxxxxdata_x360 = data_group_by_platforms[data_group_by_platforms['platform'] == 'X360']data_x360.loc[data_x360['user_score'] == 'tbd', ['user_score']] = 0 data_x360['user_score'] = data_x360['user_score'].astype(float)display(data_x360.describe(include='object'))xxxxxxxxxxfig = px.scatter_matrix(data_x360, dimensions=["user_score", "critic_score", "total_sales"], title="Scatter matrix correlation by scores and total sales on the platform X360" )fig.show()corr_user_score = data_x360['user_score'].corr(data_x360['total_sales'])corr_critic_score = data_x360['critic_score'].corr(data_x360['total_sales'])print('The correlation coefficient between user scores and total sales:')print(corr_user_score)print('The correlation coefficient between critic scores and total sales:')print(corr_critic_score)xxxxxxxxxxA low correlation between user score and sales suggest that user ratings have little to do with revenue. We see a slightly higher indicator by the ratings of critics and we also see on the graph that games with large sales are highly rated by critics.A low correlation between user score and sales suggest that user ratings have little to do with revenue. We see a slightly higher indicator by the ratings of critics and we also see on the graph that games with large sales are highly rated by critics.
xxxxxxxxxx<div class="alert alert-success" role="alert">Reviewer's comment v. 1: Yes, we have weak correlation between variables. Please note that correlation function shows only linear dependecy between variables. Maybe this link will be interesting for you: https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/.</div>Yes, we have weak correlation between variables. Please note that correlation function shows only linear dependecy between variables. Maybe this link will be interesting for you: https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/.
xxxxxxxxxx### 3.6. Distribution of games by genre<a class="anchor" id="3_6"></a>xxxxxxxxxxdata_games_by_genre = data_ice_from_2011.pivot_table(index='genre', values='total_sales', aggfunc=['mean', 'sum'])data_games_by_genre.columns = ["mean_sales", "total_sales"]data_games_by_genre = data_games_by_genre.sort_values(by='total_sales', ascending=False)display(data_games_by_genre)fig, ax = plt.subplots(figsize = (10, 10))ax.pie(data_games_by_genre['total_sales'], labels=data_games_by_genre.index, wedgeprops=dict(width=0.5))ax.set_title('Distribution of games by genre and total sales, 2011-2016')xxxxxxxxxxThe diagram above shows the distribution of each genre of video games on the market in 2011-2016. The most profitable genre is action with total sales of more than 550 USD million. In the second place the genre Shooter with total sales of about 400 USD million. In the third place - Role-Playing with total sales of 245 USD million. The least profitable genres are Puzzle, Strategy, Adventure.The diagram above shows the distribution of each genre of video games on the market in 2011-2016. The most profitable genre is action with total sales of more than 550 USD million. In the second place the genre Shooter with total sales of about 400 USD million. In the third place - Role-Playing with total sales of 245 USD million. The least profitable genres are Puzzle, Strategy, Adventure.
xxxxxxxxxxAt this step we founded how much games were released by the years (the peak of sales is 2008) and determined which platform was the most popular for the whole history of video games(PS2 and X360) and what is the most profitable platform (X360). Determined the lifecycle of the platform (about 11 years) and also determined that the lifecycle decreses over the years. I think is due to technological progress, these days, developing a new platform takes much less time than before. Also we defined the two platforms that will be popular in 2017 - XOne and PS4, and period for our investigation due to the lyfecycle of the platforms: 2011. The most popular genres are Action, Shooter and Role-Playing.At this step we founded how much games were released by the years (the peak of sales is 2008) and determined which platform was the most popular for the whole history of video games(PS2 and X360) and what is the most profitable platform (X360). Determined the lifecycle of the platform (about 11 years) and also determined that the lifecycle decreses over the years. I think is due to technological progress, these days, developing a new platform takes much less time than before. Also we defined the two platforms that will be popular in 2017 - XOne and PS4, and period for our investigation due to the lyfecycle of the platforms: 2011. The most popular genres are Action, Shooter and Role-Playing.
xxxxxxxxxx## Step 4. Create a user profile for each region<a class="anchor" id="4"></a>xxxxxxxxxx### 4.1. User profile for North American region<a class="anchor" id="4_1"></a>xxxxxxxxxxna_platforms = ( data_ice_from_2011 .pivot_table(index='platform', values='na_sales', aggfunc=['sum']))na_platforms.columns=['total_sales']na_top5_platforms = na_platforms.nlargest(5, 'total_sales')na_top5_platforms.sort_values('total_sales', ascending=False)display(na_top5_platforms)na_genres = ( data_ice_from_2011 .pivot_table(index='genre', values='na_sales', aggfunc=['sum']))na_genres.columns=['total_sales']na_top5_genres = na_genres.nlargest(5, 'total_sales')na_top5_genres.sort_values('total_sales', ascending=False)display(na_top5_genres)xxxxxxxxxx### 4.2. User profile for Europe region<a class="anchor" id="4_2"></a>xxxxxxxxxxeu_platforms = ( data_ice_from_2011 .pivot_table(index='platform', values='eu_sales', aggfunc=['sum']))eu_platforms.columns=['total_sales']eu_top5_platforms = eu_platforms.nlargest(5, 'total_sales')eu_top5_platforms.sort_values('total_sales', ascending=False)display(eu_top5_platforms)eu_genres = ( data_ice_from_2011 .pivot_table(index='genre', values='eu_sales', aggfunc=['sum']))eu_genres.columns=['total_sales']eu_top5_genres = eu_genres.nlargest(5, 'total_sales')eu_top5_genres.sort_values('total_sales', ascending=False)display(eu_top5_genres)xxxxxxxxxx### 4.3. User profile for Japan region<a class="anchor" id="4_3"></a>xxxxxxxxxxjp_platforms = ( data_ice_from_2011 .pivot_table(index='platform', values='jp_sales', aggfunc=['sum']))jp_platforms.columns=['total_sales']jp_top5_platforms = jp_platforms.nlargest(5, 'total_sales')jp_top5_platforms.sort_values('total_sales', ascending=False)display(jp_top5_platforms)jp_jenres = ( data_ice_from_2011 .pivot_table(index='genre', values='jp_sales', aggfunc=['sum']))jp_jenres.columns=['total_sales']jp_top5_genres = jp_jenres.nlargest(5, 'total_sales')jp_top5_genres.sort_values('total_sales', ascending=False)display(jp_top5_genres)xxxxxxxxxxTop 5 platformsTop 5 platforms
bar_plots_platform = [ go.Bar(x=na_top5_platforms.index, y=na_top5_platforms['total_sales'], name='North America'), go.Bar(x=eu_top5_platforms.index, y=eu_top5_platforms['total_sales'], name='Europe'), go.Bar(x=jp_top5_platforms.index, y=jp_top5_platforms['total_sales'], name='Japan')]layout = go.Layout( title = go.layout.Title(text='Market shares of platforms for each region', x=0.5), yaxis_title = 'Total sales', xaxis_tickmode = 'array', xaxis_ticktext = tuple(data_ice['year_of_release'].values), barmode='stack')fig = go.Figure(data=bar_plots_platform, layout=layout)fig.show()xxxxxxxxxxOn the histogram above we can see the market share of the 5 most popular platforms in each region. We can see the signature difference between the preferences of users from North America, Europe, and especially from Japan. In Japan is not enter into the 5 most sellest platform the most popular platform that is using by the consumers in NA and Europe, and also we see 2 platforms that are not popularly in NA and in Europe: PSP and PSV. And we have 3 platforms that are popular in the all-region: PS3, PS4, and 3DS. The XOne is in the top 5 only in NA and PC is only in Europe.On the histogram above we can see the market share of the 5 most popular platforms in each region. We can see the signature difference between the preferences of users from North America, Europe, and especially from Japan. In Japan is not enter into the 5 most sellest platform the most popular platform that is using by the consumers in NA and Europe, and also we see 2 platforms that are not popularly in NA and in Europe: PSP and PSV. And we have 3 platforms that are popular in the all-region: PS3, PS4, and 3DS. The XOne is in the top 5 only in NA and PC is only in Europe.
bar_plots_genre = [ go.Bar(x=na_top5_genres.index, y=na_top5_genres['total_sales'], name='North America'), go.Bar(x=eu_top5_genres.index, y=eu_top5_genres['total_sales'], name='Europe'), go.Bar(x=jp_top5_genres.index, y=jp_top5_genres['total_sales'], name='Japan')]layout = go.Layout( title = go.layout.Title(text='Market shares of genres for each region', x=0.5), yaxis_title = 'Total sales', xaxis_tickmode = 'array', xaxis_ticktext = tuple(data_ice['year_of_release'].values), barmode='stack')fig = go.Figure(data=bar_plots_genre, layout=layout)fig.show()xxxxxxxxxxOn the graph above we see the top sailing genres in each region (NA, Europe, and Japan). We have 2 genres that are popular in each region - Action and Role-Playing. Gamers in Japan are not interested in genres such as Shooter and Sports, which are in the top 5 most popular genres in Europe and NA, and European gamers rarely choose the Misk genre. Instead of Misk they prefer Racing (I think this is due to the love of Europeans for racing, films about racing, etc.). Japanese players are also interested in Fighting and Platform genres, which are not popular in other regions, and this difference is due to cultural characteristics and mentality of the Japanese. On the graph above we see the top sailing genres in each region (NA, Europe, and Japan). We have 2 genres that are popular in each region - Action and Role-Playing. Gamers in Japan are not interested in genres such as Shooter and Sports, which are in the top 5 most popular genres in Europe and NA, and European gamers rarely choose the Misk genre. Instead of Misk they prefer Racing (I think this is due to the love of Europeans for racing, films about racing, etc.). Japanese players are also interested in Fighting and Platform genres, which are not popular in other regions, and this difference is due to cultural characteristics and mentality of the Japanese.
xxxxxxxxxxOn this step we defind the preferences of users for each region (North America, Europe, Japan) for the video games platforms and genres. I think this is due to the differences in cultural characteristics and possibly the policy of state itself in relation to the gaming business (for example the platforms PSP and PSV are platforms developed in Japan).On this step we defind the preferences of users for each region (North America, Europe, Japan) for the video games platforms and genres. I think this is due to the differences in cultural characteristics and possibly the policy of state itself in relation to the gaming business (for example the platforms PSP and PSV are platforms developed in Japan).
xxxxxxxxxx### 5.1. Average user ratings of the Xbox One and PC platforms are the same<a class="anchor" id="5_1"></a>The null hypothesis is the average user ratings of the XBox One and PC platforms are the same. The alternative hypotesis is the average user ratings are different.The null hypothesis is the average user ratings of the XBox One and PC platforms are the same. The alternative hypotesis is the average user ratings are different.
xxxxxxxxxxtest = copy.deepcopy(data_ice_from_2011)xone_test = test[test['platform'].isin(['XOne'])]pc_test = test[test['platform'].isin(['PC'])]pivot_xone = xone_test.pivot_table(index='year_of_release', values='user_score', aggfunc='mean')pivot_xone.columns=['user_score']pivot_pc = pc_test.pivot_table(index='year_of_release', values='user_score', aggfunc='mean')pivot_pc.columns=['user_score']xone_pc = [pivot_xone, pivot_pc]pivot_total = reduce(lambda left,right: pd.merge(left,right,on='year_of_release',how='outer'), xone_pc)pivot_total.columns = ['user_score_xone', 'user_score_pc']display(pivot_total)fig = go.Figure(data=[ go.Bar(name='XBox One', x=pivot_total.index, y=pivot_total['user_score_xone']), go.Bar(name='PC', x=pivot_total.index, y=pivot_total['user_score_pc'])])fig.update_layout(barmode='group')fig.show()alpha = .05results = st.ttest_ind(pivot_xone['user_score'], pivot_pc['user_score'])print('p-value: ', results.pvalue)if (results.pvalue < alpha): print("We reject the null hypothesis")else: print("We can't reject the null hypothesis")xxxxxxxxxxWe can't reject the null hypothesis, the average user ratings of the platforms XBox One and PC are the same. As we can see, the type of platform does not have a strong influence on the user rating, the average user rating for the platforms are the same.We can't reject the null hypothesis, the average user ratings of the platforms XBox One and PC are the same. As we can see, the type of platform does not have a strong influence on the user rating, the average user rating for the platforms are the same.
xxxxxxxxxx<div class="alert alert-success" role="alert">Reviewer's comment v. 1: Yes, there are no statistical significant differences between the average user ratings. Maybe this link will be interesting for you: https://www.analyticsvidhya.com/blog/2019/09/everything-know-about-p-value-from-scratch-data-science/</div>Yes, there are no statistical significant differences between the average user ratings.
Maybe this link will be interesting for you: https://www.analyticsvidhya.com/blog/2019/09/everything-know-about-p-value-from-scratch-data-science/
xxxxxxxxxx### 5.2. Average user ratings for the Action and Sports genres are different<a class="anchor" id="5_2"></a>xxxxxxxxxxThe null hypothesis is the average user ratings of the genre Action and for the genre Sports are the same. The alternative hypotesis is the average user ratings are different.The null hypothesis is the average user ratings of the genre Action and for the genre Sports are the same. The alternative hypotesis is the average user ratings are different.
action_test = test[test['genre'].isin(['Action'])]sports_test = test[test['genre'].isin(['Sports'])]pivot_action = action_test.pivot_table(index='year_of_release', values='user_score', aggfunc='mean')pivot_action.columns=['user_score']pivot_sports = sports_test.pivot_table(index='year_of_release', values='user_score', aggfunc='mean')pivot_sports.columns=['user_score']action_sports = [pivot_action, pivot_sports]pivot_total1 = reduce(lambda left,right: pd.merge(left,right,on='year_of_release',how='outer'), action_sports)pivot_total1.columns = ['user_score_action', 'user_score_sports']display(pivot_total1)fig = go.Figure(data=[ go.Bar(name='Action', x=pivot_total1.index, y=pivot_total1['user_score_action']), go.Bar(name='Sports', x=pivot_total1.index, y=pivot_total1['user_score_sports'])])fig.update_layout(barmode='group')fig.show()alpha = .05results = st.ttest_ind(pivot_action['user_score'], pivot_sports['user_score'], equal_var = False)print('p-value: ', results.pvalue)if (results.pvalue < alpha): print("We reject the null hypothesis")else: print("We can't reject the null hypothesis")xxxxxxxxxxI set the parameter equal_var to False since we don't accept that the means of the user scores will be the same, it's the two independent populations. It turns that the user ratings for the games of genre action and genre sports are the same. This is logical since we are considering the two most popular genres, that are constantly competing with each other.I set the parameter equal_var to False since we don't accept that the means of the user scores will be the same, it's the two independent populations. It turns that the user ratings for the games of genre action and genre sports are the same. This is logical since we are considering the two most popular genres, that are constantly competing with each other.
xxxxxxxxxxWe can see the total annual revenue for console games declining. They were at their peak in 2004, now their popularity is falling. My guess is that this is due to the growing market for mobile video games (or another type of video game that is gaining popularity: virtual reality). Regionally, North America is the largest market (percentage of total games sold over the past five years), followed by Japan.The gaming industry is quite specific, and in the graphs above, we see how strongly the region in which the researched user is located affects. the user profile by region shows us that the CC platform is the most popular in Japan, which is not even included in the top 5 most popular user platforms in America and Europe. For greater success, I propose to divide the purchased types of video games depending on the region where the store is located in accordance with the user profiles that we have compiled.Here are the user's profiles for each region and let's take into account mainly those platforms that are at the peak of popularity or are gaining popularity:1. North America. NA has the largest share of the consumer video games market. The most popular platforms in this region are X360, PS3, PS4, Xone, and 3DS. But taking into account the research conducted, I propose to focus on sales on the PS4 and XOne platforms. Here is the list of the most popular genres (in descending order): action, sport, shooter, platform, and misk.2. Europe. The preferences of European users of video games do not differ much in terms of genres, the only difference is that the 5 most popular genres in Europe include not platform, but racing. As for the type of platform, we also see only one difference with the NA region: the PC, not the XOne.3. Japan. For this region, I propose to promote video games on the PS4 platform, continue selling 3DS, and also two platforms that are popular only in Japan: PSP and PSV. The most popular genre in this region is role-playing, and further in descending order: action, misc, fighting, platform.We can see the total annual revenue for console games declining. They were at their peak in 2004, now their popularity is falling. My guess is that this is due to the growing market for mobile video games (or another type of video game that is gaining popularity: virtual reality). Regionally, North America is the largest market (percentage of total games sold over the past five years), followed by Japan. The gaming industry is quite specific, and in the graphs above, we see how strongly the region in which the researched user is located affects. the user profile by region shows us that the CC platform is the most popular in Japan, which is not even included in the top 5 most popular user platforms in America and Europe. For greater success, I propose to divide the purchased types of video games depending on the region where the store is located in accordance with the user profiles that we have compiled. Here are the user's profiles for each region and let's take into account mainly those platforms that are at the peak of popularity or are gaining popularity:
Taking snapshot - this may take a few seconds
Snapshot succeeded - newplot.png